-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up fill_overpass_data #248
base: main
Are you sure you want to change the base?
Conversation
@FlxPo Thanks for the PR. It's definitely a good idea, but unfortunately doesn't generally speed things up at all in the full context of the library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
doc <- "test.osm"
if (!file.exists (doc)) {
q <- opq ("soest germany") |> # or wherever, but that's around 130,000 lines so big enough
add_osm_feature (key="highway")
osmdata_xml (q, filename="test.osm", quiet = FALSE)
}
f1 <- function (obj, doc) {
doc <- xml2::read_xml (doc)
obj <- get_metadata (obj, doc)
doc <- as.character (doc)
list (obj, doc)
}
f2 <- function (obj, doc) {
nch <- file.info (doc)$size
doc_xml <- xml2::read_xml (doc)
obj <- get_metadata (obj, doc_xml)
doc <- readChar (doc, nch)
list (obj, doc)
}
obj <- osmdata ()
bench::mark (f1 = f1 (obj, doc),
f2 = f2 (obj, doc))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 f1 163ms 164ms 5.95 NA 0
#> 2 f2 252ms 426ms 2.35 NA 2.35 Created on 2021-11-04 by the reprex package (v2.0.1.9000) The key statistic in the table is the "median" time, which is more than twice as slow via this PR. In spite of that, i do suspect you're on to something here - the reason that it is so much slower is because you're effectively reading the document twice: once via That said, the extraction of metadata is currently done entirely as XML nodes. Hand coding a version for just the bits we need here would be possible, but could only really be done efficiently in C/C++, and even then really would verge of re-inventing the wheel. Even then, this actual reading is a relatively small portion of the total processing time for large queries, so it would be a lot of work for relatively very little gain. For those reasons, I think this PR should just be closed. Sorry about that, especially acknowledging that it's your first PR, but rest assured that it really does reflect a good idea - just not one that is ready to be implemented here. I'll wait to hear back from you before closing. Thanks |
Thanks for testing, i should have provided a benchmark ! I'm surprised by your results, I'm getting the opposite conclusion when I run your code (slightly modified to avoid a weird "Error: Each result must equal the first result:
I'm running this with R 4.1.1 (Windows 64-bit, 6 cores, 32 Go of RAM). I also tried f1 and f2 on a much bigger file (360 Mo) : f2 completes after a few seconds, but f1 runs forever. I tried to use the bench::mark function to get precise estimates but I finally killed it, I need processing power for other tasks at the moment. I will try to complete the benchmark another time. |
PR for issue #247
(this is my first PR ever, tell me if there's anything missing !)